Getting Structured Data from the Internet: Running Web CrawlersScrapers on a Big Data Production Scale by Jay M. Patel

Getting Structured Data from the Internet: Running Web CrawlersScrapers on a Big Data Production Scale by Jay M. Patel

Author:Jay M. Patel [Patel, Jay M.]
Language: eng
Format: epub
ISBN: 9781484265765
Publisher: Apress


from sklearn.metrics import silhouette_samples

y_km = km.fit_predict(X_train_text)

pd.Series(y_km).value_counts().to_dict()

# Output

{1: 395, 7: 292, 4: 289, 6: 242, 2: 158, 3: 149, 0: 145, 5: 110}

Listing 4-42Kmeans clustering

We checked top terms per cluster in Listing 4-43 since all the preceding clusters seem to have quite a balanced number of members. We can add cluster numbers as a column to the document term matrix dataframe and filter the dataframe to show documents from individual clusters. Once we have a filtered dataframe, it's just a matter of adding up token weights, transposing it, and sorting it in descending order to display top 30 terms from each cluster.df_dtm["cluster_name"] = y_km

df_dtm.head()

cluster_list = len(df_dtm['cluster_name'].unique())

for cluster_number in range(cluster_list):

print("*"*20)

print("Cluster %d: " % cluster_number)

df_cl = df_dtm[df_dtm['cluster_name'] == cluster_number]

df_cl = df_cl.drop(columns = 'cluster_name')

print("Total documents in cluster: ", len(df_cl))

print()

df_sum = df_cl.agg(['sum'])

df_sum = df_sum.transpose()

df_sum_transpose_sort_descending= df_sum.sort_values(by = 'sum', ascending = False)

df_sum_transpose_sort_descending.index.name = 'words'

df_sum_transpose_sort_descending.reset_index(inplace=True)

print(','.join(df_sum_transpose_sort_descending.words.iloc[:30].tolist()))

# Output

********************

Cluster 0:

Total documents in cluster: 145



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.